Some Linear-Time Algorithms for Systolic Arrays* 

Richard P. Brent ' H. T. Kung * Franklin T. Luk § 



Abstract 

We survey some recent results on linear-time algorithms for systolic arrays. In particular, we 
show how the greatest common divisor (GCD) of two polynomials of degree n over a finite 
field can be computed in time 0(n) on a linear systolic array of 0(n) cells; similarly for the 
GCD of two n-bit binary numbers. We show how n by n Toeplitz systems of linear equations 
can be solved in time O(n) on a linear array of 0(n) cells, each of which has constant memory 
size (independent of n). Finally, we outline how a two-dimensional square array of 0(n) by 
O(n) cells can be used to solve (to working accuracy) the eigenvalue problem for a symmetric 
real n by n matrix in time 0(nS(n)). Here S(n) is a slowly growing function of n; for practical 
purposes S(n) can be regarded as a constant. In addition to their theoretical interest, 
these results have potential applications in the areas of error-correcting codes, symbolic and 
algebraic computation, signal processing and image processing. For example, systolic GCD 
arrays for error correction have been implemented with the microprogrammable "PSC" chip. 

1 Introduction 

A "systolic array" is a regular array of simple machines or "cells" with a nearest-neighbour 
interconnection pattern. A pipeline is an example of a linear systolic array in which data flows 
only in one direction, but systolic arrays may be two-dimensional (rectangular, triangular or 
hexagonal) and data may flow between the cells in several different directions and at several 
different speeds. The concept of systolic arrays has recently been developed by H.T. Kung and 
his students [24, 35, 36, 37, 38, 43], although related ideas can be found in earlier work on 
models of computation [19, 29]. 

Systolic arrays may be implemented as synchronous or asynchronous systems, but for expository 
purposes we shall consider only synchronous systems. Systolic arrays are not necessarily fixed, 
special-purpose systems; they can be programmed [5, 21, 49, 57] or simulated by more general 
parallel machines [34, 55], although at some loss of efficiency. 



A "systolic algorithm" is a specification of the operation of each cell in a systolic array, together 
with a specification of the interconnection pattern of the array. Systolic algorithms have been 
suggested for solving many compute-bound problems, e.g. binary and polynomial arithmetic, 
convolution, filtering, matrix multiplication, solution of linear systems and least squares prob- 
lems, and geometric problems [6, 18, 25, 37, 39]. Here we survey some recent results on systolic 
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algorithms. The results are interesting because they show that systolic arrays can be used to 
solve certain important problems in linear (or almost linear) time; the problems considered have 
practical applications in areas such as error correction, symbolic computation, signal processing 
and image processing. 

The problems considered here are the computation of greatest common divisiors of polynomials 
(over a finite field) and of binary integers, the solution of Toeplitz systems of linear equations, 
and the solution of the symmetric eigenvalue problem. The first two problems require a linear 
array with uni-directional data flow (i.e. a pipeline), the third requires a linear array with bi- 
directional data flow, and the fourth requires a square (two-dimensional) array. The third and 
fourth problems require the use of floating-point arithmetic, and the fourth requires an iterative 
rather than a direct solution. The third and fourth problems also illustrate a common technique 
for converting a "semi-systolic" array (i.e. one with global broadcasting) into a true systolic 
array [43]. Because of space limitations we have had to omit many details, for which we refer 
the reader to the original papers [10, 11, 14, 15]. 



2 Polynomial GCD computation 

The polynomial GCD problem is to compute a greatest common divisor of any two nonzero 
polynomials. This problem is fundamental to algebraic and symbolic computations and to the 
decoder implemetations for a variety of error-correcting codes [9, 32, 46]. Many algorithms for 
solving the GCD problem are known [2, 7, 32]. However, for direct hardware implementation 
these algorithms are too irregular and/or too complex to be useful. For example, the classical 
Euclidean algorithm involves a sequence of divisions of polynomials whose size can only be de- 
termined during the computation. We shall describe some simple and regular systolic structures 
which can provide efficient hardware solutions to the GCD problem. 

In particular, we describe a systolic array of m + n + 1 cells which can find a GCD of any two 
polynomials of degrees m and n . Figure 1 illustrates that the coefficients of the given polyno- 

n m 

mials aiX % and bjX^ enter the leftmost cell and the output (their GCD) emerges from the 

1=0 j=0 

rightmost cell of the array. 

More precisely, if a unit of time is taken to be the cell cycle time (which is essentially the time 
required to perform a division or a multiplication and an addition), the 2(m + n + 1) time units 
after a n and b m enter the leftmost cell, the co-efficients of the GCD start emerging from the 
rightmost cell at the rate of one co-efficient per unit time. Unlike the systolic arrays described 
in Sections 4 and 5, the array illustrated in Figure 1 is a pipeline, as data flows through it in 
only one direction (although not necessarily at constant speed). 
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Figure 1: Systolic array for polynomial GCD 

The systolic arrays described in this paper are suitable for VLSI implementation [47] and can 
achieve high throughputs. The systolic polynomial GCD algorithms were developed in order to 
implement a decoder for Reed-Solomon error-correcting codes with the Programmable Systolic 
Chip (PSC) [21]. 
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Since it is not easy to understand some of the more complicated systolic algorithms, we shall start 
with the basic ideas and describe some simple algorithms first. Hopefully informal arguments 
will convince the reader that our algorithms are correct. Formal correctness proofs are beyond 
the scope of this paper. Nevertheless, every systolic algorithm mentioned below has been tested 
by simulation, using Pascal or Lisp programs on a serial computer, so we may have some degree 
of confidence in their correctness. 

2.1 GCD-preserving transformations 

All well-known algorithms for solving the polynomial GCD problem are based on the general 
technique of reducing the degrees of the two given polynomials by "GCD-preserving" trans- 
formations. A GCD-preserving transformation transforms a pair (A, B) of polynomials into 
another pair (A, B) such that a GCD of A and B is also a GCD of A and B, and vice versa. 
(We say "a GCD" because a GCD over a finite field is not generally unique.) When one of the 
two polynomials is reduced to zero by a sequence of such transformations, the other polynomial 
will be a GCD of the original two polynomials. We use this general technique, but choose very 
simple GCD-preserving transformations to permit their implementation by a systolic array. 

We assume throughout this section that the co-efficients of the polynomials belong to a fi- 
nite field. This is true for the decoder application for error-correcting codes; in [10] it is 
shown that straightforward modifications of our designs require no divisions and work over any 
unique factorisation domain. We define two GCD-preserving transformations, Ra and Rb ■ Let 
A = a.iX l + ■ ■ ■ + do and B = bjx^ + • • • 60 be the two polynomials to be transformed, where 
a,i 7^ and bj 7^ . 

Transformation Ra (for the case i — j > 0) : 

r i 

1 1 

A — *j I — - A = A — qx d B where d = i — j and at/bj . 



Transformation Rb (for the case i — j < 0) 



Rb \~~ - =A 

B — *n I — *- B = B — qx d A where d = j — i and q = bj/di 



It is obvious that both the transformations are GCD-preserving. Furthermore, Ra decreases the 
degree of A, i.e. degA < deg^4, and Rb decreases the degree of B , i.e. deg-B < degB . (For 
notational convenience we assume that the degree of the zero polynomial is — 1 .) 

2.2 Transformation sequence for polynomial GCD computation 

To compute a GCD of two given polynomials Aq and Bq of degrees n and m , we can apply a 
sequence of GCD-preserving transformations, each one being either Ra or Rb , until one of the 
two polynomials is transformed to zero; at this point the other (nonzero) polynomial is a GCD 
of Aq and Bq . We call this sequence of transformations the transformation sequence for Aq 
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and Bq , and denote it by (Ti, T2, . . . , T^) for some A;. Tj transforms (^4j_i,Sj_i) to (Ai,Bi) 
Note that the transformation sequence is uniquely defined for given Aq and Bq . 



An instructive way to view the function of the transformation sequence is to imagine that poly- 
nomials Aq and Bq move through the transformation "stages" Zi,T2, . . . , T^ from left to right, 
being transformed at each stage; when they emerge from the last stage Tfc , one will be the zero 
polynomial and the other will be a GCD of Aq and Bq . 

Suppose that transformation Tj reduces the sum of the degrees of its input polynomials Aj_i 

and Bi-x by <5j > . We call 8i the reduction value of Tj . Since the sum of the degrees of Aq 

k 

and Bq at the beginning of the GCD is n + m , we have ^ Sj < n + m + 1 . 

i=i 

2.3 A systolic array for polynomial GCD computation 

We now specify a systolic array of n + m + 1 cells which can compute a GCD of any two input 
polynomials Aq and Bq (not both zero) of degrees no more than n and m , respectively. 

Consider the transformation sequence (Ti, . . . , T^j for Aq and So • For each i = 1, . . . , k , trans- 
formation Tj can be realised by a subarray of Si cells, where <5j is the reduction value of Tj . 

fc 

Since ^ <5j < n + m + 1 , a systolic array with n + m + 1 calls can realise all the transformations. 
i=i 

This is illustrated in Figure 2. 
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Figure 2: (a) Transformation sequence, and 

(b) its realisation by three concatenated systolic subarrays 



2.3.1 The basic idea for realising a single transformation 

Let T be any transformation in the transformation sequence (Ti, . . . , Tfc) , and 5 its reduction 
value. We illustrate how a subarray with 5 cells can realise T , assuming that we know which of 
Ra and Rb the transformation T is (see Section 2.3.2 below). We consider the case when T is 
Ra ; the case when T is Rb can be treated similarly. Without loss of generality, we can assume 
that T transforms (A, B) to (A,B) = (A - qx d B, B) where 

A = aiX % -\ h ao , B = bjx j -\ h 6 , 

a,i / , bj , q = ai/bj , and d = i — j > . 

Note that either A = or A = ai~$x l ~ 5 + ■ ■ ■ + Tlq , where ais / . The systolic subarray for 
realising T is shown in Figure 3. 
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Terms of A and B move through the subarray in a serial manner, high degree terms first (there 
is a dual with low degree terms first). The nonzero leading terms of A and B are aligned so 
that they enter the leftmost cell of the subarray during the same cycle. Besides the systolic data 
paths for a and b, there is a 1-bit wide systolic control path, denoted by start; a true (i.e. 1) 
value on this path signals to a cell the beginning of a new GCD computation in the following 
cycle. In Figure 3 and below, 1-bit wide systolic control paths and associated registers are shown 
by dotted arrows and boxes. 
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if start then 
begin 

q := ain/bin; 

aout := {pad in zeros for vanishing terms in A} 
end 

else aout := ain - q*bin; 

bout := b; b := bin; {it takes 2 cycles for each b to pass a cell} 
startout := start; start := startin. 

Figure 3: Systolic subarray and its cell definition for realising a transformation Ra 

It is easy to see that the leftmost cell performs q := at/bj in the first cycle and computes terms 
of A in subsequent cycles. The qs computed by other cells are always zero, since terms of A that 
have degree higher than % — b are zero. The only function of these cells is to shift the coefficients 
of A faster than those of B (notice that each coefficient of B stays in each cell for two cycles). 
Thanks to these "shifting" cells the nonzero leading term His of A will emerge from the right- 
most cell at the same cycle as bj , the nonzero leading term of B . Thus ais and bj are aligned to 
enter another subarray of cells to the right in order to realise whatever transformation follows T . 

Note that there is no need to keep track of the value of 5 in the systolic subarray. If A is nonzero, 
the realisation of the transformation following T starts automatically at the first cell that sees 
a nonzero input (i.e. ais) on it input line (denoted by ain). If A is the zero polynomial then T 
must be the last transformation Tk In this case, the coefficients of B will continue being shifted 
to the right to be output from the rightmost cell, and they will form terms in the desired GCD. 



5 



2.3.2 A design using the difference of degrees 
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dout := d; 

start out := start; 

case state {possible states are initial, reduceA and reduceB} of 

initial: {wait for the beginning of a GCD computation} 
begin 

aout := a; bout := b; 
if start then 
begin 

if (ain = 0) or ((bin ^ 0) and (din > 0)) then 
begin 

state := reduceA; 

if bin = then q := else q := ain/bin; a := 0; 
b := bin; d := din — 1 
end 
else 
begin 

state := reduceB; q := bin/ain; b := 0; 
a := ain; d := din + 1 
end 
end 
end; 

reduceA: {transform A and shift a's faster than b's} 
begin 

if startin then state := initial; 

aout := ain — q*bin; bout := b; b := bin; 

d := din 

end; 

reduceB: {transform B and shift b's faster than a's} 
begin 

if startin then state := initial 
aout := a; a:= ain; bout := bin — q*ain; 
d := din 
end 
end; {case} 
start := startin. 

Figure 4: Cell definition for the algorithm using differences of degrees 

We have seen that a systolic subarray with cells defined as in Figure 3 can realise the trans- 
formation T if it is known whether T is Ra or Rb ■ Let d = deg A — deg B , where A and B 
are the polynomials to be transformed by T . Then T is Ra if d > 0, otherwise T is Rb (see 
Section 2.1). The cell definition of Figure 4 keeps track of the value of d , and consequently it is 
able to determine on the fly which transformation to perform. As in Figure 3, we specify the cell 
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using a Pascal-like language. There are three states; operations performed by each cell during 
a cycle depend on the state of the cell. Initially, every cell is in state inital. Triggered by the 
start signal a cell will go to one of the other two states (reduceA or reduceB) and eventually 
return to state initial. 

To illustrate the definition, consider once more the systolic subarray depicted in Figure 3. Sup- 
pose that d = i — j > and bj ^ . Marching to the right together with bj is the current 
value of d. Each cell upon receiving a true value on the systolic control part start will go to 
state reduceA (since d > 0). When ais 0) and bj are output from the rightmost cell of the 
subarray, they will enter the cell to the right in the following cycle with state reduceA if d > 
or reduceB if d < . 

With m + n + 1 cells a systolic array based on this design can compute a GCD of any two 
polynomials of total degree less than m + n + 1 . Moreover, immediately after the input of one 
pair of polynomials, a new pair of polynomials can enter the systolic array. That is, the systolic 
array can compute GCDs for multiple pairs of polynomials simutaneously, as they are pipelined 
through the array. 

We assume that none of the given pairs of polynomials have x as a common factor, so their 
GCDs have nonzero constant terms. (A common power of x can easily be factored out before 
the computation.) With this assumption, one can tell what the GCD is from the output emerging 
from the rightmost end of the array. The constant term of the GCD is the last nonzero term 
emerging from the array before output of the next batch of polynomials commences, and the 
high degree terms of the GCD are those terms which emerged earlier on the same output line. If 
it is inconvenient to assume that the GCDs have nonzero constant terms, one can either keep the 
degrees explicitly (instead of just their difference) or have a "stop" bit to indicate the location 
of the leftmost of ao and &o • 

2.4 Some extensions 

The "extended" GCD problem is to find not only a greatest common divisor of G of Aq and Bq , 
but also polynomials U and V such that UAq + VBq = G . The extended GCD problem is 
important for many applications, including the decoder implementation for a variety of error- 
correcting codes. The systolic array described above can be modified to compute U and V : 
see [10] for details. 

By keeping track of the beginning and end of each polynomial during the computation, it is 
possible to avoid explicitly using the difference of degrees of the polynomials (and hence no upper 
bound on this difference need be known when the cells are designed). Also, by interchanging 
A and B as necessary, we can ensure that the output GCD always emerges on a fixed output 
line. These modifications lead to systolic algorithms whose implementations require fewer I/O 
pins, which is an important practical consideration. The cell definition for one such algorithm 
is given in Appendix A: by interchanges it ensures that d > , and d is represented in unary as 
the distance between 1-bits on the start and sig control paths. 
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3 Integer GCD computation 



Consider now the problem of computing the greatest common divisor GCD (a, b) of two pos- 
itive integers a and b, assumed to be given in the usual binary representation. Our aim is to 
compute GCD (a, b) in time O(n) on a linear systolic array, where n is the number of bits 
required to represent a and b (say a < 2 n , b < 2 n ). The significant difference between integer 
and polynomial GCD computations is that carries have to be propagated in the former, but not 
in the latter. 



The classical Euclidean algorithm [32] may be written as: 

b 



while b ^ do 



a mod b 



; GCD := a 



This is simple, but not attractive for a systolic implementation because the division in the inner 
loop takes time Q(ri) . More attractive is the "binary" Euclidean algorithm [8, 32] which uses 
only additions, shifts and comparisons. 

{assume a, b odd for simplicity} 
t := la - b| ; 
while t / do 
begin 

repeat t := t div 2 until odd(t); 
if a > b then a := t else := t; 
t := la - b| 
end; 
GCD := a. 



However, if we try to implement the binary Euclidean algorithm on a systolic array we encounter 
a serious difficulty: the test " if a > b ..." may require knowledge of all bits of a and b, so 
again the inner loop takes time tt(n) in the worst case. 



3.1 Algorithm PM 

{assume a odd and b ^ , I a. I < 2 n , |b| < 2 n } 
a : = n; (3 : = n; 
repeat 

while even(b) do begin b := b div 2; (3 := (3—1 end; 
{now b odd, |b| < 2 P } 

if a > (3 then begin swap (a, b) ; swap (a, (3) end; { "swap" has obvious meaning} 
{now a < (3 , I a I < 2" , |b| < 2^, a odd, b odd} 
if ((a+b) mod 4) = then b := (a+b) div 2 else b := (a-b) div 2; 
{now b even, |b| < 2 /3 } 
until b = 0; 
GCD : = I a I . 



Figure 5: Precursor to Algorithm PM 

Algorithm PM (for "plus- minus" ) , like the classical and binary Euclidean algorithms, finds the 
GCD of two n-bit integers a and b in 0(n) iterations, but we shall see that it can be imple- 
mented in a pipelined fashion (least significant bits first) on a systolic array. Before defining 
Algorithm PM we consider the "precursor" algorithm defined in Figure 5. Using the assertions 
contained in braces, it is easy to prove that the algorithm terminates in at most 2n + 1 iterations 
(since a + (3 strictly decreases at each iteration of the repeat block, except possibly the first). 
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Moreover, since all transformations on a and b are GCD-preserving, the GCD is computed 
correctly. 

It is not necessary to maintain a and j3: all we need is their difference 8 = a — f3 (analogous 
to the difference of degrees in Section 2). This observation leads to Algorithm PM, which is 
defined in Figure 6. 

{assume a odd, b ^ 0} 
d := 0; 
repeat 

while even(b) do begin b := b div 2; d := 8 + 1 end; 
if 8 < then begin swap (a, b) ; 8 := —S end; 

if ((a+b) mod 4) = then b := (a+b) div 2 else b := (a-b) div 2 
until b = 0; 
GCD : = I a I . 



Figure 6: Algorithm PM 



3.2 Implementation on a systolic array 

For implementation on a systolic array, Algorithm PM has a great advantage over the classical 
or binary Euclidean algorithms: the tests in the inner loop involve only the two least-significant 
bits of a and b . Hence, a cell can perform these tests before the high-order bits of a and b reach 
it via cells to its left. (The termination criterion "until b = 0" is not a problem: see below.) 

We have to consider implementation of operations on 8 in Algorithm PM. The only operations 
required are U S := 5 + 1", "5 := —5", and "if 8 > . . . ". Rather than represent 8 
in binary, we choose a "sign and magnitude unary" representation, i.e. keep sign (8) and \8\ 
separate, and represent e = \8\ in unary as the distance between 1-bits in two stream of bits. 
With this representation all required operations on 8 can be pipelined. 
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Figure 7: Systolic cell for integer GCD computation 



After some optimisations we obtain the systolic cell illustrated in Figure 7 and defined in 
Appendix B. The cell has six input streams (each one bit wide): ain and bin for the bits of the 
numbers a and b represented in 2's complement binary (least significant bit first), startin to 
indicate the least significant bit of a, and three additional streams startoddin, epsin and negin 
which should be all zero on input to the leftmost cell, startoddin is used to indicate the least 
significant 1-bit of a or b, epsin and negin are used to represent 8 . There are six corresponding 
output streams (connected, of course, to the input streams of the cell to the right). The cell has 
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twelve internal state bits: one for each of the six inputs and six additional bits (wait , shift , 
carry, swap, eps2 and minus). 

The termination criterion (b = 0) need not be checked because once b is reduced to zero, cells 
further to the right will implement the statement "begin b := b div 2; d := d + 1 end" 
(see Figure 6) and transmit a unchanged, so the correct result will emerge from the rightmost 
cell. All we need is at least An cells to guarantee that Algorithm PM has reduced b to zero. 
Actually, 3.1106n + 1 cells suffice: see [11]. Note that the final output may represent — GCD 
(a,b) in 2's complement: an additional 0{n) cells are required to ensure that the output is 
+GCD (a,b). 

The definition of the cell illustrated in Figure 7 is given in Appendix B. It implements Algorithm 
PM (see Figure 6) with a straightforward modification to allow even inputs as well as odd. 



4 Solution of Toeplitz systems 

A Toeplitz matrix A = (a^) is one which is constant along each diagonal, i.e. aij is a function 
of j — i (which we denote by Oj_j). We are interested in the solution of a Toeplitz system of 
linear equations Ax = b , 



where A = (a,j-i) 
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x ri 



(It is convenient to consider (n + l)-vectors and (n + 1) by (n + 1) Toeplitz matrices, with in- 
dices running from to n .) Large Toeplitz systems often arise in filtering and signal processing 
applications [1, 40, 56]: values of n greater than 1000 are common, so it is important to have 
special algorithms which take advantage of the Toeplitz structure. In applications A is often 
symmetric positive-definite, but we do not assume this here. 

Several serial algorithms which require time 0(n 2 ) are known for the solution of Toeplitz systems, 
for examples see [3, 31, 44, 58, 62]. Serial algorithms requiring time 0(n log 2 n) and space 0(n) 
are also known [4, 9, 48], although their practical value is not yet clear [54]. 

4.1 Systolic algorithms for Toeplitz systems 

It is natural to ask if a linear systolic array of 0(n) cells can be used to solve a Toeplitz system 
in time 0(n) . This is indeed the case [1, 41, 42, 49], but the systolic algorithms presented in 
the cited papers have two shortcomings. 

1. They assume that A is symmetric, and 

2. they use VL{n 2 ) storage, i.e. f2(n) words per cell. 

We shall outline a systolic algorithm which avoids both these shortcomings: it applies to un- 
symmetric Toeplitz systems (although it may be specialised to the symmetric case with some 
savings if desired), and the total storage required is 0(n) words, i.e. only a constant per cell. 
(We consider words of storage rather than bits: a word is assumed to be large enough to hold 
one floating-point number or an integer of size 0(n) , although the latter requirement could be 
avoided.) 
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4.2 The algorithm of Bareiss 

Our systolic architecture uses an algorithm of Bareiss [3] to compute (implicitly) an LU factori- 
sation of the Toeplitz matrix A . Historically the Bareiss recursions in the symmetric case are 
due to Schur [53]. 

Define the "shift" matrix Z k by Z k = (z^ ) , 

l i 1 otherwise . 

Thus, premultiplication of A by Z k shifts the rows of A up k places with zero fill. The Bareiss 
algorithm is defined in Figure 8. At a typical stage of the Bareiss algorithm, the matrices A^~ k ^ 
and A^ +k "> have the structure illustrated in Figure 9 (k = 0, . . . , n) . 

The Bareiss algorithm computes the same LU factorisation of A as would be obtained by Gaus- 
sian elimination without pivoting: U = A^~ n ^ and a$L = ( J 4(+ ra )) T2 , where the superscript 
"T2" denotes reflection in the main antidiagonal. It is assumed that all leading principal mi- 
nors of A are nonsingular, so the LU factorisation of A exists. By transforming the right-hand 
side as shown in Figure 8, we obtain an upper triangular system A^~ n ^x = b(~ n ^ , so A^ +n ^ may 
be discarded if our objective is merely to solve Ax = b. 

Because A^~ n ^ is not Toeplitz, the Bareiss algorithm appears to require 0(n 2 ) storage. However, 
at the expense of some extra computation, we can get by with 0{n) storage. The key idea is 
that we can run the Bareiss algorithm backwards to regenerate the elements of A^~ n ^ as they 
are required to solve the triangular system A^~ n ^x = bf-~ n ^ by "back-substitution" [60], using 
the Toeplitz structure of /3_ and £ (see Figure 9) and the equations 

A (k-i) = A^ + m k Z k A(-V 

and 

A (i-k) = A (-k) + m _ kA _ kA {k~i) 



A<°> := A; b<°> := b; 
for k := 1 to n do 
begin 

m_fc := aj^Q^/ao; {diagonal elements of A^ fe— 1 ) equal ao} 

A (-fe) . = A (i-fc) _ m _ fc z_ fc A (fe_1) ; {only store g± and /3 : see Figure 9} 

b(-fe) := b(i-fe) - m.feZ.feb^- 1 ); 

A (+ fe ) : = A( fe_1 ) — m_|_ fc Z_|_ fc A( _fe ' ; {only store 7 and 8 : see Figure 9} 

b<+ fe ) := b^" 1 ) - m +fe Z +fe b(- fe ) 

end. 

{now A( - "')x = b(~ n ^ is an upper triangular system} 

Figure 8: The Bareiss algorithm 
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k rows upper triangular, not Toeplitz 

n + 1 — k rows upper triangular, Toeplitz (/3) 

k zero diagonals 

n — k rows lower triangular, Toeplitz (a) 

n — k rows upper triangular, Toeplitz (5) 
k zero diagonals 



lower triangle, top n + 1 — k rows Toeplitz (7) 
bottom k rows, not Toeplitz 



Figure 9: Structure of A^ fc ) and in the Bareiss algorithm 

for k = n, n — 1, . . . , 1 . (Observe that row k of A^~ n ^ is equal to row k of A^~ k ^ .) Hence, only 
0(n) storage is required to regenerate rows n, n — 1, . . . , of A^~ n ^ : we need only to save the 
multipliers m±k and the last column of A^~ n ^ . In the systolic algorithm described below these 
0(n) numbers are simply saved in the appropriate systolic cells. 

A different way of reducing the storage requirements to 0(n) is to use the Gohberg-Semencul 
formula [9] for the inverse of A , but the method outlined above is simpler and can take advantage 
of any band structure in A [15]. 

4.3 A systolic algorithm for the solution of Toeplitz systems 

In the Bareiss algorithm four triangular Toeplitz matrices are updated (see Figures 8 and 9). 



a 



7 





We use a linear array of cells Pq , Pi , . . . , P n where cell Pk has registers to store , /3& , 7^ and 5k ■ 
(When describing cell Pk we omit the subscripts and simply refer to registers a,/3,7 and 5.) 
Cell Pk requires four additional registers: for a multiplier m_j , [ik for a multiplier m + j , and 
£k and rjk which are associated with the right-hand side vector b and the solution x . 
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Phase 1: LU decomposition by the Bareiss algorithm 



Ph- 



ot , 5 , £ 



ft 



a , 5 , £ 



/'/, • 



Phase 2: Back substitution to solve triangular system 



ft 



fc-1 



ft 



7i Mi *) 



ft 



fc+i 



Figure 10: Data flow for systolic Toeplitz solver 

Data flows in both directions between adjacent cells as shown in Figure 10. Each cell needs 
five input and five output data paths, denoted by inLl, inL2, inRl , inR2, inR3, outLl, 
outL2, outL3, outRl and outR2 (see Figure 11). 




Figure 11: Cell for systolic Toeplitz solver 

To avoid broadcasting multipliers A and u during Phase 1, we use a common technique [14, 37, 
43]: cells are active only on alternate time steps (ft), ft 5 • • • at time T = 0, 2, . . . and ft, ft . . . at 
time T = 1, 3, . . .) , and the operation cell is delayed by k time steps relative to the operation 
of cell ft) • A similar technique is used during Phase 2, to avoid broadcasting 5 and £ . (For 
another example of this technique, see Section 5.3.) 



Initialisation is as follows: 



for k := 1 to n do 
begin 

a k := a_(fc + i); (3 k := a fc ; *y k := a_ fc ; S k := & k+1 ; 
Afc := 0; fi k := 0; £ fc := b n _ k _i ; r] k := b„,_ fc ; 

{we assume that a_(„_|_ 1 ) = a„_|_i = b_i = to cover end-conditions} 
end. 



Clearly this can be done in time 0(n) if A and b are available at either end of the systolic array. 

The definition of cell ft(0 < k < n) is given in Appendix C. The final solution x is given 
by Xfc = £fc , where ^ is stored in register £ of processor ft after step 4n . We make some 
observations concerning the definition; for further details see [15]: 

1. Cell ft is active only if k < T < 2n - k (Phase 1) or 2n + k < T < An - k (Phase 2). It is 
assumed that cell ftc knows its index k and the current value of T (though this could be 
avoided by the use of 1-bit systolic control paths as in Section 3). 
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2. Pairs of adjacent cells could be combined, since only one cell of each pair is active at each 
time step. This would increase the mean cell utilisation from 25% to 50% (see observation 
1 above). 

3. Cell Pq performs floating-point divisions, other cells perform only additions and multi- 
plications. The total number of multiplications is 4.5n 2 + 0(n) . A time step has to be 
long enough for six floating-point additions and multiplications, plus data transfers, dur- 
ing Phase 1 (three during Phase 2); these may be performed concurrently (with trivial 
modifications to the cell definition in Appendix C) provided Po is sufficiently fast. 

4. If at = ^ VjVj+k for some data yj , as is common in applications [40, 44], the coefficient 

3 

ak can be computed in place by cell Pk . 

5. Simplifications are possible in the symmetric case (a& = a-k) ■ 

6. The algorithm is numerically unstable in the general case, because it involves the LU 
factorisation of A without pivoting. In fact, the algorithm breaks down if a principal 
minor of A is singular (e.g. if ao = 0). However, in applications A is often diagonally 
dominant or positive definite (see observation 4 above). For further discussion of the 
numerical properties of related algorithms, see [20, 56]. Sweet [56] gives an 0(n 2 ) time 
(serial) algorithm which computes an orthogonal factorisation of A and is numerically 
stable, but we do not know if it can be implemented in time 0(n) on a systolic array of 
0(n) cells. 

7. Cell Pfc typically reads its input lines inLl, . . . ,inR3, does some floating-point computa- 
tions, and then writes to its output lines outLl, . . . ,outR2. Hence, pairs of lines could 
be combined into single bidirectional lines (e.g. inLl and outLl could be combined). 



5 The Symmetric Eigenvalue Problem 

In this section we consider the problem of computing the eigenvalues (and, if required, the eigen- 
vectors) of a real symmetric n by n matrix A , using a systolic algorithm. Unlike the problems 
considered above, this problem must be solved by an iterative method. Several authors [6, 28, 
52] have suggested the use of the QR algorithm, but this does not seem particularly well-suited 
to parallel computation. Instead, we resurrect the old-fashioned method of Jacobi [30], since 
it is possible to implement it efficiently on a systolic array. Sameh [51] suggested the use of 
Jacobi's method on a parallel computer; the idea of permuting rows and columns (as directed 
in Section 5.2) to avoid global communication requirements was first suggested by Brent and 
Luk [14]. 

A "sweep" is defined in Section 5.1 below. Suppose that the Jacobi method requires S sweeps 
for convergence to working accuracy. For random symmetric matrices A it is conjectured [16] 
that S = O(logn); in practice S < 10 for all reasonable values of n [14, 16, 50]. We sketch 
how a sweep can be performed in time 0(n) on a square array of [n/2] by [n/2] systolic cells, so 
the symmetric eigenproblem can be solved in time 0(nS) . The reader is referred to [13, 14] for 
many details which are omitted here because of space limitations. 
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5.1 The serial Jacobi method 



The Jacobi method generates a sequence {A^} of symmetric matrices by A^+i := R^AkRk, 
where Rk is a plane rotation and A\ = A. Let Rk = (rpq*) > = j an d suppose that i?*. 

represents a rotation in the plane, with i < j . We have 



r r {k) 

ii 
-(*) 


-(*) " 
ij 




-(*) 






ii - 





cos 6 sin 
- sin 6 cos 6* 



where the angle 6 is chosen so as to reduce the element of Ak+i to zero. The formulae used 
to compute sin# and cos# are given by Rutishauser [50]. The matrix Ak+i differs from Ak only 
in rows and columns i and j . 

There remains the problem of choosing , which is usually done according to some fixed 
cycle. It is desirable to go through all the off-diagonal elements exactly once in any sequence 
(called a "sweep") of n(n — l)/2 rotations. A simple sweep consists of the cyclic-by-rows or- 
dering (1, 2), (1, 3), . . . , (1, n), (2, 3), . . . , (2, n), (3, 4), . . . , (n - 1, n) . Forsythe and Henrici [23] 
prove that the cyclic-by-rows Jacobi method always converges if \9\ < tt/4 (which can always be 
enforced). The serial Jacobi method enjoys ultimate quadratic convergence [59]. 

Unfortunately, the cyclic-by-rows scheme is apparently not amenable to parallel processing. In 
Section 5.2 we represent an ordering which enables us to do \n/2\ rotations simultaneously. 
The (theoretical) price is the loss of quaranteed convergence. Hansen [26] defines a "preference 
factor" when comparing different orderings for the serial Jacobi method. Our new ordering is in 
fact quite desirable, even for serial computation, for it asymptotically optimises the preference 
factor as n — > oo . Thus, although the convergence proof of [23] does not apply, we expect 
convergence in practice to be faster than for the cyclic-by-rows ordering, and simulation results 
[14, 16] support this conclusion. To ensure convergence we may adopt a "threshold" approach 
[60]: associate a threshold value with each sweep, and when making the transformation of that 
sweep, omit any rotation if the doomed off-diagonal element is smaller in magnitude than the 
threshold. 



5.2 A semi-systolic algorithm for the symmetric eigenvalue problem 

We first describe a semi-systolic algorithm for implementing the Jacobi method. The algorithm 
is semi-systolic rather than systolic because it assumes the ability to broadcast row and column 
rotation parameters (i.e. sin 6, cos 6 values). In Section 5.3 we show how to avoid broadcasting. 

Assume for simplicity that n is even. We use a square array of n/2 by n/2 systolic cells, each 
cell containing a 2 by 2 submatrix of A . Initially cell Py contains 

a2i-l,2j-l a-n-i,2j 
0-2i,2j-l a 2i,2j 

for i,j = 1,..., n/2, and Pij is connected to its nearest neighbours Pi±ij and Pij±i ■ 



In general P^ contains four real numbers 
by symmetry of A . 



Pij 

Si-, 



where = , <% = 5ji and = jji 
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The diagonal cells Pa (i = l,...,n/2) act differently from the off-diagonal cells P^ 

(1 <i,j< n/2, i / j). At each time step the diagonal cell Pa computes a rotation 

to annihilate its off-diagonal elements fin and 7^ (actually fin = 7^) and update its diagona 
elements an and 5u accordingly. To complete these rotations, the off-diagonal cells Pij (i 7^ j) 
must perform the transformations 



Otij 










OCij 










Sij 








lij 









We asume that the diagonal cell Pa broadcasts the rotation parameters q and Sj to cells Pij 
and Pji (j = 1, . . . , n/2) in the same row and column. 



To complete a step, columns and corresponding rows are interchanged between adjacent cells so 
that a new set of n off-diagonal elements is ready to be annihilated by the diagonal cells during 
the next time step. The interchanges are done in two sub-steps. First, adjacent columns are 
interchanged according to the permutation 

P = (357. . . (2n - l)(2n)(2n - 2)(2n - 4) . . . 42) . 

Note that this is not the "perfect shuffle" permutation; it is a permutation used in the singu- 
lar value decomposition algorithm of [13], and only requires nearest-neighbour communication 
between systolic processors. Next, the same permutation P is applied to the rows, to maintain 
symmetry. From Section 3 of [13] it is clear that a complete sweep is performed every n — 1 
steps, because each off-diagonal element of A is moved into one of the diagonal cells in exactly 
one of the n — \ steps. This is illustrated for the case n = 8 in Figure 12. 
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Figure 12: Indices of off-diagonal elements in diagonal cells over a full sweep (n = 8) 
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5.3 Further details - avoiding broadcast of rotation parameters 

In [14] details such as the threshold strategy, computation of eigenvectors, use of diagonal con- 
nections between cells, taking full advantage of symmetry, handling the case of odd n , etc are 
discussed. Here we omit these details, but outline an important point: how to avoid broadcast 
of rotation parameters, i.e. to convert the semi-systolic algorithm of Section 5.2 into a systolic 
algorithm, while retaining total running time 0(nS) for the algorithm. 

Let Ajj = \i — j\ denote the distance of cell Pij from the diagonal. The operation of cell Pij will 
be delayed by Ajj time units relative to the operation of the diagonal cells, in order to allow 
time for rotation parameters to be propagated at unit speed along each row and column of the 
systolic array. 

A cell cannot commence the computations associated with a rotation until data from ear- 
lier rotations is available on its input lines. In particular, cell P^ needs data from cells 
Pj_ij_i,Pj_ij + i,Pj + ij_i and Pj+ij+i for 1 < i < n/2,1 < j < n/2 (the boundary cases 
are slightly different). Since 

|Ajj — Aj + ij + i| < 2 , 

it is sufficient for cell Pij to be idle for two time steps while waiting for the cells Pi±i j±i to 
complete their (possibly delayed) steps. Thus, the price paid to avoid broadcasting rotation 
parameters is that each cell is active for only one third of the total computation time. A similar 
inefficiency occurs in many other systolic algorithms, see for example [6, 12, 35, 37, 43] and 
Section 4.3. In a practical design triples of three adjacent cells could share a floating-pont unit 
to ameliorate this inefficiency. Alternatively, "idle" cells could be used to increase the reliability 
of the systolic array by performing redundant computations [33] . 

5.4 Some extensions 

We have sketched how the symmetric eigenvalue problem can be solved in time 0{nS) , where 

5 is for practical purposes bounded by 10, using a square array of 0(n 2 ) systolic processors. 
The speedup over the usual 0(n 3 ) serial algorithms (e.g. tridiagonalisation followed by the QR 
algorithm) is significant for moderate or large n . Related algorithms for computing the singular 
value decomposition on a systolic array are presented in [13, 17]. For the unsymmetric eigenvalue 
problem the question is open - the ideas used in the symmetric case do not all carry over to 
Eberlein's methods [22, 51] in an obvious way. However, everything does carry over with the 
obvious changes to complex Hermitian or normal matrices. 

6 Conclusion 

Systolic arrays provide cost-effective solutions to many important compute-bound problems, 
although they are not a universal panacea. The examples presented in Sections 2-5 illustrate 
that the best serial algorithm does not always lead to the best systolic algorithm. A systolic 
array with n cells can simulate (in real time) a Turing machine which uses at most n squares of 
tape, but a "good" systolic algorithm should be significantly faster than a simulation of a Turing 
machine. There are many problems for which the existence of a good systolic algorithm remains 
an open question. Other open questions are: how to compile code for a programmable systolic 
array [21], how to prove the correctness of cell definitions such as those given in Appendices 
A-C, and how best to implement the systolic cells. For example, should they use the bit-serial 
approach advocated in [1, 45, 47] or the bit-parallel approach of [21] ? 
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Appendix A: Cell definition for systolic polynomial GCD 
computation 

{The language used here and below is Pascal with some trivial extensions. To save space, 
obvious declarations have been omitted.} 

aout := a; a := ain; {standard transfers} 

bout := b; b := bin; {assume deg B < deg A} 

startout := start; start := startin; {true for start of polynomial A } 
stopout := stop; stop := stopin; {true for end of polynomial A } 
sigout := sig; sig := sigin; {initially sig true if corresponding b ^ 0} 

case state {possible states are initial, shift, swap and trans} of 
initial: {wait for next start signal} 
if start and not stop then 

if b = then state := shift else 

begin q := a/b; {division can be avoided} 
if sig then 

begin state := swap; a := b; sig := false end 
else state := trans 
end; 

shift: {shift B faster than A } 
begin bout := b; b := 0; 
if stop then state := initial 
end; 

swap: {transform, shift and interchange} 
begin bout : = a - q*b ; a : = b ; b : = ; 
sig : = (bout 7^ 0) ; 
if stop then state := initial 
end; 

trans: {transform, shift} 

begin aout := a - q*b; a := 0; 

if stop then state := initial; 

stopout := stop; stop := false; 

sigout := sig; sig := false 

end 
end {case}. 
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Appendix B: Cell definition for systolic integer GCD computation 

{See Figure 7 for I/O ports} 

aout := a; a := ain; {standard transfers} 

bout := b; b := bin; 

start out := start; start := start in; 

startoddout := startodd; startodd := startoddin; 

epsout := eps2; eps2 := eps; eps := epsin; {delay here} 

negout := neg; 

wait := (wait or start) and not startodd; {wait for nonzero bit} 

if startodd or (wait and (a or b)) then 
begin 

eps := eps or wait; 

eps2 := 0; {0 = false, 1 true} 

neg := negin and not wait; 
startodd := 1; 

wait := 0; {end of waiting for a nonzero bit} 

swap := not a; 

shift := not (a and b) 

end 

else if wait then epsout := eps2 {normal speed} 
else if shift then {shift b faster than a, may also swap} 
begin 

aout := (bout and swap) or (aout and not swap); {normal speed} 

bout := (a and swap) or (b and not swap); {fast speed} 

epsout := (eps and neg) or (epsout and not neg); 

neg := neg and not (eps and startoddout); {d may become zero} 

negout : = neg 

end 

else if startoddout then 
begin 



epsout := eps2; 


{normal speed} 




swap := not neg; 






neg := neg or not eps2; 


{d := -\6\} 




negout := neg; 






aout : = aout or swap ; 


{swap implies b} 




bout := 0; 


{and new b is even} 




carry := a © b; 


{may be borrow or carry; © is exclusive 


or} 


minus : = not carry 


{1 iff we form (b - a) div 2} 




end 






else {not startoddout} 






begin 






epsout := eps2; 


{normal speed} 




aout : = (bout and swap) or 


(aout and not swap) ; {normal speed} 




bout := a © b carry; 


{fast speed} 




carry := majority (b, carry, 


a © minus) {majority true if 2 or 3 of its 


arguments true} 


end. 
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Appendix C: Cell definition for systolic Toeplitz equation solver 



{Program for cell k at time step T, 0<k<n, 0<T< 4n} 
{See Figure 11 for I/O ports} 

if even(T+k) and (T > k) and (T<2n-k) then {Phase 1 - LU factorisation} 
begin 

if T > k then {accept inputs from cell k+1} 

begin a. : = inRl; 8 := inR2; £ := inR3 end; 
if k = then {compute multiplier} A := cn/7 else 

begin {accept multipliers from cell k-1} 

A := inLl; /_t : = inL2; 

a := a - A * 7 

end; 

j3 := j3 - A * 8; 77 := 77 - A * £ ; 

if k = then {compute multiplier} fi := 8 I (3 else 
begin 

7 := 7 - fi * a ; 
S := 8 - fi * /3 ; 
£ := $ - n * 77 
end; 

outLl := a ; outL2 := 8 ; outL3 := £ ; {ignore outLl-3 if k = 0} 
outRl := A ; outR2 := fi {ignore outRl-2 if k = n} 

end 

else if even(T + k) and (T > 2n+k) and (T < 4n-k) then {Phase 2 - back substitution} 
begin 

if T > 2n+k then begin A := inRl; fi := inR2; 77 := inR3 end; 
if k = then begin £ := rj / (3; 8 := fi * (3 end else 
begin 

£ := inLl; 8 := inL2; 

r) := 77 - (3 * £ ; 8 := 8 + /j * f3 

end; 

(3 := (3 + A * 8 ; 

outLl := A ; outL2 := fi ; outL3 : = 77 ; {ignore if k = 0} 
outRl := £ ; outR2 := 8 {ignore if k = n} 

end. 
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